LOADING THE PACKAGES¶
# Data Manipulation and Representation
import pandas as pd
import numpy as np
# Statistical Analysis
import scipy.stats as stats
import statsmodels.api as sm
import pingouin as pg
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Web Interaction and Display
from IPython.display import Image, display, HTML
# Visualization
import matplotlib.pyplot as plt
# Miscellaneous
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
import chardet
from scipy.stats import ttest_rel, f_oneway
import statsmodels.api as sm
# Additional JavaScript for toggling code display in Jupyter Notebooks
HTML(
"""
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script>
<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
value="Click here to toggle on/off the raw code."></form>
"""
)
ABSTRACT
This study explores the relationship between Gross Domestic Product (GDP) and CO2 emissions across various income brackets, testing the Environmental Kuznets Curve (EKC) hypothesis. Utilizing regression models and statistical analysis, it was found that high-income countries have significantly higher CO2 emissions per capita, indicating a strong correlation between economic prosperity and environmental impact. The analysis reveals that GDP is a crucial predictor of CO2 emissions in both developed and developing nations, with a more pronounced effect in the latter. These findings suggest that economic growth alone may not lead to environmental improvement, challenging the inevitability of the EKC hypothesis. Based on these insights, the study recommends the implementation of differential carbon taxes based on income brackets, the promotion of technology transfer to developing countries, and the establishment of international standards for sustainable development. These policies aim to align economic development with environmental sustainability, addressing the nuanced dynamics of global economic and environmental interactions.
INTRODUCTION
Background
An applicable theory to this subject is the environmental Kuznets Curve (EKC). The Environmental Kuznets Curve (EKC) posits that economic progress initially results in environmental deterioration, but after a certain threshold of economic growth is reached, society starts to enhance its environmental connection and lower levels of environmental degradation.
This hypothesis faces substantial criticism due to the absence of a guarantee that economic expansion would result in an enhanced environment. For most developed countries the opposite is often the case, at the very least it requires a very targeted policy and attitude to make sure economic growth is compatible with an improving environment.
Problem Statement
This study aims to explore the relationship between GDP and CO2 emissions across different income brackets, focusing on the validity of the Environmental Kuznets Curve (EKC) hypothesis. It investigates whether economic growth leads to increased environmental degradation, as represented by CO2 emissions, before a turning point is reached where further economic development corresponds with environmental improvement. Through statistical analysis and regression models, the research seeks to discern the patterns of CO2 emissions in developed and developing countries, thereby shedding light on the global dynamics of economic development and environmental sustainability.
Objectives
The following are the objectives of this study.
Understand the Relationship Between GDP Growth and CO2 Emissions:
Evaluate the correlation between economic growth (GDP) and changes in CO2 emissions in the context of the Environmental Kuznets Curve hypothesis.Assess the Environmental Impact of COVID-19:
Investigate the statistical significance of the COVID-19 pandemic's impact on global CO2 emissions and the GDP of various countries.Analyze Disparities in GDP Per Capita Across Income Brackets:
Examine variances in GDP per capita among countries from different income brackets and their relation to CO2 emissions.Develop Predictive Models for CO2 Emissions:
Construct and refine linear regression models to predict CO2 emissions, incorporating economic indicators, and optimize these models through feature selection.
Ultimately, after knowing all the key insights gained from these objectives, the team wants to offer some recommendations for International Policy Insights. This study would allow policymakers to balance economic growth with environmental sustainability, in line with the Paris Agreement goals.
Methodology
In this statistics case project, our methodology focuses on analyzing environmental and economic data. We start by processing data, selecting specific countries, merging datasets, and preparing them for Isolated Regression Models. Next, we create a user-friendly pipeline for regression model functions. Our statistical analysis aims to answer key questions about the Earth's healing, COVID-19's impact on GDP, and GDP per capita variances across income brackets, using T-tests and ANOVA for hypothesis testing. Finally, we conduct Linear Regression Analysis to predict CO2 emissions, involving model development, feature selection, and refinement, complemented by plots of residuals, observed vs. fitted values, and Normal Q-Q plots.
Step-by-Step Process:
Data Processing: Select specific countries for inclusion in the study. Merge various datasets to create a comprehensive and unified dataset. Prepare the data for application in Isolated Regression Models.
Pipeline Creation: Develop a pipeline that facilitates the ease of use of the required functions for regression analysis.
Statistical Analysis: Perform hypothesis testing, including T-Tests (Related Samples) and ANOVA. Address key questions:
- Is Earth healing?
- Did COVID-19 significantly impact the GDP of all countries?
- Is there a significant variance in GDP per capita across different income brackets?
- Linear Regression Analysis:
- Develop a full model to predict CO2 emissions.
- Conduct feature selection to optimize and refine the model.
- Generate an improved model based on the selected features.
- Visualize the analysis through plots for residuals, observed vs. fitted values, and the Normal Q-Q Plot.
- Develop a full model for developed and developing countries
- Do the same succeeding steps as the full model involving all countries
DATA DESCRIPTION
The Data.csv is a tabular data structure with several columns, including "Country Name," "Country Code," "Series Name," "Series Code," and data for the years 2019 and 2020. It contains information related to carbon dioxide (CO2) emissions, Gross Domestic Product (GDP), population, and urban land area for various countries. Here's a breakdown of the columns:
| Column Name | Description |
|---|---|
| Country Name | The name of the country. |
| Country Code | A unique code assigned to each country. |
| Series Name | Descriptive name of the economic or environmental indicator. |
| Series Code | A code associated with the series, possibly used for identification. |
| 2019 [YR2019] | Data for the year 2019 related to the specified series. |
| 2020 [YR2020] | Data for the year 2020 related to the specified series. |
This mapper_df.csv dataset a consist of two columns: "Country Name" and "Income Bracket." The "Country Name" column contains the names of various countries, while the "Income Bracket" column categorizes each country into income brackets using abbreviations such as "L" (Low), "UM" (Upper Middle), "LM" (Lower Middle), and "H" (High). Here's a breakdown of the columns:
| Column Name | Description |
|---|---|
| Country Name | The name of the country. |
| Income Bracket | The income bracket classification assigned to each country. |
The renewable.csv is a table containing information on global electricity generation from various sources for the years 1971 to 2021. Here's a breakdown of the columns:
| Column Name | Description |
|---|---|
| Entity | This column represents the geographical or political entity for which the electricity generation data is reported. It includes both individual regions like "Africa" and a global aggregate labelled as "World." |
| Code | This column contains a code or identifier associated with each entity. For individual regions. |
| Year | This column indicates the corresponding year for which the electricity generation data is recorded. The data spans from 1971 to 2021. |
| Geo Biomass Other - TWh | This column represents electricity generation in terawatt-hours from sources categorized as "Geo Biomass Other." |
| Solar Generation - TWh | This column represents electricity generation in terawatt-hours from solar sources. |
| Wind Generation - TWh | This column represents electricity generation in terawatt-hours from wind sources. |
| Hydro Generation - TWh | This column represents electricity generation in terawatt-hours from hydroelectric sources. |
DATA PROCESSING
df = pd.read_csv('Data.csv')
# Import Country Classification per Income Bracket
mapper = pd.read_excel('Income_Bracket.xlsx',
sheet_name='Country Analytical History',
skiprows=range(11),
index_col=0,
header=None, usecols=[1, 34]).to_dict()[34]
country_list = ['Afghanistan', 'Norway', 'Mozambique', 'Myanmar', 'Namibia',
'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand',
'Nicaragua', 'Niger', 'Nigeria', 'North Macedonia',
'Northern Mariana Islands', 'Montenegro', 'Oman', 'Pakistan',
'Palau', 'Panama', 'Papua New Guinea', 'Paraguay', 'Peru',
'Philippines', 'Poland', 'Portugal', 'Morocco', 'Mongolia',
'Madagascar', 'Latvia', 'Lebanon', 'Lesotho', 'Liberia', 'Libya',
'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macao SAR, China',
'Malawi', 'Monaco', 'Malaysia', 'Maldives', 'Mali', 'Malta',
'Marshall Islands', 'Mauritania', 'Mauritius', 'Mexico',
'Micronesia, Fed. Sts.', 'Moldova', 'Puerto Rico', 'Qatar',
'Uganda', 'Switzerland', 'Syrian Arab Republic', 'Tajikistan',
'Tanzania', 'Thailand', 'Timor-Leste', 'Togo', 'Tonga',
'Trinidad and Tobago', 'Tunisia', 'Turkiye', 'Turkmenistan',
'Turks and Caicos Islands', 'Tuvalu', 'Ukraine', 'Romania',
'United Arab Emirates', 'United Kingdom', 'United States',
'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela, RB', 'Viet Nam',
'Virgin Islands (U.S.)', 'West Bank and Gaza', 'Yemen, Rep.',
'Zambia', 'Sweden', 'Suriname', 'Sudan', 'Russian Federation',
'Rwanda', 'Samoa', 'San Marino',
'Saudi Arabia', 'Senegal', 'Serbia', 'Seychelles', 'Sierra Leone',
'Singapore', 'Sint Maarten (Dutch part)', 'Slovak Republic',
'Slovenia', 'Solomon Islands', 'Somalia',
'South Africa', 'South Sudan', 'Spain', 'Sri Lanka',
'St. Kitts and Nevis', 'St. Lucia', 'St. Martin (French part)',
'St. Vincent and the Grenadines', 'Burkina Faso', 'Cabo Verde',
'Cambodia', 'Cameroon', 'Canada', 'Cayman Islands',
'Central African Republic',
'Chad', 'Channel Islands', 'Chile', 'China', 'Colombia', 'Comoros',
'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica', "Cote d'Ivoire",
'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czechia', 'Denmark',
'Djibouti', 'Dominica', 'Dominican Republic', 'Burundi', 'Bulgaria',
'Brunei Darussalam', 'Albania', 'Algeria', 'American Samoa',
'Andorra', 'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia',
'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas, The',
'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium', 'Belize',
'Benin', 'Bermuda', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina',
'Botswana', 'Brazil', 'British Virgin Islands', 'Ecuador', 'Guyana',
'Ireland', 'Honduras', 'Hong Kong SAR, China', 'Hungary', 'Iceland',
'India', 'Indonesia', 'Iran, Islamic Rep.', 'Iraq', 'Isle of Man',
'Haiti', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan',
'Kenya', 'Kiribati', "Korea, Dem. People's Rep.", 'Korea, Rep.',
'Kosovo', 'Kuwait', 'Kyrgyz Republic', 'Lao PDR', 'Zimbabwe',
'Egypt, Arab Rep.', 'Guinea-Bissau', 'El Salvador',
'Equatorial Guinea', 'Eritrea', 'Estonia', 'Eswatini', 'Ethiopia',
'Faroe Islands', 'Fiji', 'Finland', 'France', 'French Polynesia',
'Gabon', 'Gambia, The', 'Georgia', 'Germany', 'Ghana', 'Gibraltar',
'Greece', 'Greenland', 'Grenada', 'Guam', 'Guatemala', 'Guinea']
RETRIEVE COUNTRIES WITH AVAILABLE DATA ON ALL FEATURES¶
# EMISSION
df_emission = df[(df['Series Name'] == 'CO2 emissions (kt)') &
(df['2019 [YR2019]'] != '..') &
(df['2020 [YR2020]'] != '..')] \
.set_index('Country Name')
emission_countries = set(df_emission.index).intersection(set(country_list))
# GDP
df_gdp = df[(df['Series Name'] == 'GDP (constant 2015 US$)') &
(df['2019 [YR2019]'] != '..') &
(df['2020 [YR2020]'] != '..')] \
.set_index('Country Name')
gdp_countries = set(df_gdp.index).intersection(set(country_list))
# INTERSECTION OF AVAILABLE DATA
country_intersect = sorted(
list(emission_countries.intersection(gdp_countries)))
CREATE DATA SUBSETS¶
# Emission Datasets
df_emission = df_emission.loc[country_intersect]
df_emission_2019 = df_emission['2019 [YR2019]'].astype('float')
df_emission_2020 = df_emission['2020 [YR2020]'].astype('float')
# GDP Datasets
df_gdp = df_gdp.loc[country_intersect]
df_gdp_2019 = df_gdp['2019 [YR2019]'].astype(float)
df_gdp_2020 = df_gdp['2020 [YR2020]'].astype(float)
# Population Datasets
df_pop = df[(df['Series Name'] == 'Population, total') &
(df['2019 [YR2019]'] != '..')] \
.set_index('Country Name')
df_pop = df_pop.loc[list(country_intersect)]
df_pop_2019 = df_pop['2019 [YR2019]'].astype('float')
df_pop_2020 = df_pop['2020 [YR2020]'].astype('float')
# Emission per Capita Dataset
df_epc_2019 = pd.DataFrame(
{'CO2 per capita': (df_emission_2019/df_pop_2019*1000).to_list(),
'Income Bracket': [mapper[country] for country in df_emission_2019.index]},
index=df_emission_2019.index)
1. Load the energy dataset¶
df_energy = pd.read_csv('renewable.csv', usecols=lambda x: x not in ['Code'])
df_energy.head(3)
| Entity | Year | Geo Biomass Other - TWh | Solar Generation - TWh | Wind Generation - TWh | Hydro Generation - TWh | |
|---|---|---|---|---|---|---|
| 0 | Africa | 1971 | 0.164 | 0.0 | 0.0 | 26.013390 |
| 1 | Africa | 1972 | 0.165 | 0.0 | 0.0 | 29.633196 |
| 2 | Africa | 1973 | 0.170 | 0.0 | 0.0 | 31.345707 |
2. Get the countries of interest and 2019 and 2020 values¶
wanted_countries = ['Algeria', 'Argentina', 'Australia', 'Austria',
'Azerbaijan', 'Bangladesh', 'Belarus', 'Belgium',
'Brazil', 'Bulgaria', 'Canada', 'Chile', 'China',
'Colombia', 'Croatia', 'Cyprus', 'Czechia', 'Denmark',
'Ecuador', 'Egypt', 'Estonia', 'Finland', 'France',
'Germany', 'Greece', 'Hong Kong', 'Hungary', 'Iceland',
'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel',
'Italy', 'Japan', 'Kazakhstan', 'Kuwait', 'Latvia',
'Lithuania', 'Luxembourg', 'Malaysia', 'Mexico', 'Morocco',
'Netherlands', 'New Zealand', 'North Macedonia', 'Norway',
'Oman', 'Pakistan', 'Peru', 'Philippines',
'Poland', 'Portugal', 'Qatar', 'Romania', 'Russia',
'Saudi Arabia', 'Singapore', 'Slovakia', 'Slovenia',
'South Africa', 'South Korea', 'Spain', 'Sri Lanka',
'Sweden', 'Switzerland', 'Taiwan', 'Thailand',
'Trinidad and Tobago', 'Turkey', 'Turkmenistan', 'Ukraine',
'United Arab Emirates', 'United Kingdom', 'United States',
'Uzbekistan', 'Venezuela', 'Vietnam']
# Only get the countries of interest
df_energy = df_energy[df_energy['Entity'].isin(wanted_countries)]
2019 Values¶
# Get 2019 values only
df_energy_2019 = df_energy[df_energy['Year'] == 2019]
# Get the total renewable energy consumption
df_energy_2019['Total Renewable Energy Consumption'] = (df_energy_2019
.iloc[:, 3:7]
.sum(axis=1))
df_energy_2019 = df_energy_2019.loc[:, ['Entity', 'Total Renewable Energy Consumption']]
df_energy_2019.head(3)
| Entity | Total Renewable Energy Consumption | |
|---|---|---|
| 168 | Algeria | 0.777000 |
| 225 | Argentina | 33.300167 |
| 396 | Australia | 51.789451 |
2020 Values¶
# Get 2020 values only
df_energy_2020 = df_energy[df_energy['Year'] == 2020]
# Get the total renewable energy consumption
df_energy_2020['Total Renewable Energy Consumption'] = (df_energy_2020
.iloc[:, 3:7]
.sum(axis=1))
df_energy_2020 = df_energy_2020.loc[:, ['Entity', 'Total Renewable Energy Consumption']]
df_energy_2020.head(3)
| Entity | Total Renewable Energy Consumption | |
|---|---|---|
| 169 | Algeria | 0.742300 |
| 226 | Argentina | 34.424309 |
| 397 | Australia | 60.870478 |
3. World Bank Data¶
df_wb = df.loc[:, ~df.columns.isin(['Series Code', 'Country Code'])]
df_wb.head(3)
| Country Name | Series Name | 2019 [YR2019] | 2020 [YR2020] | |
|---|---|---|---|---|
| 0 | Afghanistan | CO2 emissions (kt) | 11238.83 | 8709.47 |
| 1 | Afghanistan | GDP (constant 2015 US$) | 22071985906.2168 | 21553051296.9328 |
| 2 | Afghanistan | Population, total | 37769499 | 38972230 |
Get the countries of interest and 2019 values¶
wanted_countries = ['Algeria', 'Argentina', 'Australia', 'Austria',
'Azerbaijan', 'Bangladesh', 'Belarus', 'Belgium',
'Brazil', 'Bulgaria', 'Canada', 'Chile', 'China',
'Colombia', 'Croatia', 'Cyprus', 'Czechia', 'Denmark',
'Ecuador', 'Egypt', 'Estonia', 'Finland', 'France',
'Germany', 'Greece', 'Hong Kong', 'Hungary', 'Iceland',
'India', 'Indonesia', 'Iran', 'Iraq', 'Ireland', 'Israel',
'Italy', 'Japan', 'Kazakhstan', 'Kuwait', 'Latvia',
'Lithuania', 'Luxembourg', 'Malaysia', 'Mexico', 'Morocco',
'Netherlands', 'New Zealand', 'North Macedonia', 'Norway',
'Oman', 'Pakistan', 'Peru', 'Philippines',
'Poland', 'Portugal', 'Qatar', 'Romania', 'Russia',
'Saudi Arabia', 'Singapore', 'Slovakia', 'Slovenia',
'South Africa', 'South Korea', 'Spain', 'Sri Lanka',
'Sweden', 'Switzerland', 'Taiwan', 'Thailand',
'Trinidad and Tobago', 'Turkey', 'Turkmenistan', 'Ukraine',
'United Arab Emirates', 'United Kingdom', 'United States',
'Uzbekistan', 'Venezuela', 'Vietnam']
# Only get the countries of interest
df_wb = df_wb[df_wb['Country Name'].isin(wanted_countries)]
df_wb.head(3)
| Country Name | Series Name | 2019 [YR2019] | 2020 [YR2020] | |
|---|---|---|---|---|
| 8 | Algeria | CO2 emissions (kt) | 170582.4 | 161563 |
| 9 | Algeria | GDP (constant 2015 US$) | 177355540257.925 | 168310407704.439 |
| 10 | Algeria | Population, total | 42705368 | 43451666 |
World Bank Data --- Merged¶
df_wb_co2 = df_wb[df_wb['Series Name'] == 'CO2 emissions (kt)']
df_wb_gdp = df_wb[df_wb['Series Name'] == 'GDP (constant 2015 US$)']
df_wb_pop = df_wb[df_wb['Series Name'] == 'Population, total']
# Rename the columns per dataframe
# CO2
df_wb_co2.rename(columns={'2019 [YR2019]': 'CO2 Emissions (2019)',
'2020 [YR2020]': 'CO2 Emissions (2020)'},
inplace=True)
df_wb_co2.drop('Series Name', axis=1, inplace=True)
# GDP
df_wb_gdp.rename(columns={'2019 [YR2019]': 'GDP (2019)',
'2020 [YR2020]': 'GDP (2020)'},
inplace=True)
df_wb_gdp.drop('Series Name', axis=1, inplace=True)
# Population
df_wb_pop.rename(columns={'2019 [YR2019]': 'Population (2019)',
'2020 [YR2020]': 'Population (2020)'},
inplace=True)
df_wb_pop.drop('Series Name', axis=1, inplace=True)
df_wb_merged = pd.merge(df_wb_co2, df_wb_gdp, on='Country Name', how='inner')
df_wb_merged = pd.merge(df_wb_merged, df_wb_pop, on='Country Name',
how='inner')
df_wb_merged.head(3)
| Country Name | CO2 Emissions (2019) | CO2 Emissions (2020) | GDP (2019) | GDP (2020) | Population (2019) | Population (2020) | |
|---|---|---|---|---|---|---|---|
| 0 | Algeria | 170582.4 | 161563 | 177355540257.925 | 168310407704.439 | 42705368 | 43451666 |
| 1 | Argentina | 168162 | 154535.9 | 571450737224.442 | 514630046744.607 | 44938712 | 45376763 |
| 2 | Australia | 395199.1 | 378996.8 | 1491740073728.97 | 1490980996778.96 | 25340217 | 25655289 |
World Bank Data --- 2020¶
df_wb_2019 = df_wb_merged[['Country Name', 'CO2 Emissions (2019)', 'GDP (2019)', 'Population (2019)']]
df_wb_2020 = df_wb_merged[['Country Name', 'CO2 Emissions (2020)', 'GDP (2020)', 'Population (2020)']]
df_wb_2020.head(3)
| Country Name | CO2 Emissions (2020) | GDP (2020) | Population (2020) | |
|---|---|---|---|---|
| 0 | Algeria | 161563 | 168310407704.439 | 43451666 |
| 1 | Argentina | 154535.9 | 514630046744.607 | 45376763 |
| 2 | Australia | 378996.8 | 1490980996778.96 | 25655289 |
4. Merge Renewable dataset and WB dataset¶
# # For 2019
df_all_2019 = pd.merge(df_wb_2019, df_energy_2019, left_on='Country Name',
right_on='Entity',
how='outer').drop(['Country Name', 'Entity'], axis=1)
df_all_2019.dropna(axis=0, how='any', inplace=True)
df_all_2019 = df_all_2019.astype(float)
# Set 'Country Name' as the index (and assign the result back to df_all_2020)
# df_all_2019 = df_all_2019.set_index('Country Name')
df_all_2019.head(3)
# # For 2020
df_all_2020 = pd.merge(df_wb_2020, df_energy_2020, left_on='Country Name',
right_on='Entity',
how='outer').drop(['Country Name', 'Entity'], axis=1)
df_all_2020.dropna(axis=0, how='any', inplace=True)
df_all_2020= df_all_2020.astype(float)
# Set 'Country Name' as the index (and assign the result back to df_all_2020)
# df_all_2020 = df_all_2020.set_index('Country Name')
df_all_2020.head(3)
| CO2 Emissions (2020) | GDP (2020) | Population (2020) | Total Renewable Energy Consumption | |
|---|---|---|---|---|
| 0 | 161563.0 | 1.683104e+11 | 43451666.0 | 0.742300 |
| 1 | 154535.9 | 5.146300e+11 | 45376763.0 | 34.424309 |
| 2 | 378996.8 | 1.490981e+12 | 25655289.0 | 60.870478 |
5. Data Loading and Pre-processing for the Isolated Regression Models¶
# For isolated regression later (2019)
df_all_2019_2 = pd.merge(
df_wb_2019, df_energy_2019, left_on="Country Name", right_on="Entity", how="outer"
).drop(["Entity"], axis=1)
df_all_2019_2.dropna(axis=0, how="any", inplace=True)
df_all_2019_2.iloc[:, 1:] = df_all_2019_2.iloc[:, 1:].astype(float)
# For isolated regression later (2020)
df_all_2020_2 = pd.merge(
df_wb_2020, df_energy_2020, left_on="Country Name", right_on="Entity", how="outer"
).drop(["Entity"], axis=1)
df_all_2020_2.dropna(axis=0, how="any", inplace=True)
df_all_2020_2.iloc[:, 1:] = df_all_2020_2.iloc[:, 1:].astype(float)
# What is the income bracket of each country?
mapper = pd.read_csv("mapper_df.csv")
# We only consider countries based on our wanted_countries variable
all_countries_labels = mapper[mapper["Country Name"].isin(wanted_countries)]
all_countries_labels.head(3)
# Include the Income Bracket column of the mapper to both df_all_2019_2 and df_all_2020_2
df_all_2019_2 = pd.merge(
df_all_2019_2, all_countries_labels, on="Country Name")
df_all_2020_2 = pd.merge(
df_all_2020_2, all_countries_labels, on="Country Name")
# Developed countries - 2019
developed_2019 = (
df_all_2019_2[df_all_2019_2["Income Bracket"] == "H"]
.reset_index(drop=True)
.drop(["Country Name", "Income Bracket"], axis=1)
)
developed_2019 = developed_2019.astype(float)
developing_2019 = (
df_all_2019_2[df_all_2019_2["Income Bracket"].isin(["L", "LM", "UM"])]
.reset_index(drop=True)
.drop(["Country Name", "Income Bracket"], axis=1)
)
developing_2019 = developing_2019.astype(float)
# Developed countries - 2020
developed_2020 = (
df_all_2020_2[df_all_2020_2["Income Bracket"] == "H"]
.reset_index(drop=True)
.drop(["Country Name", "Income Bracket"], axis=1)
)
developed_2020 = developed_2020.astype(float)
developing_2020 = (
df_all_2020_2[df_all_2020_2["Income Bracket"].isin(["L", "LM", "UM"])]
.reset_index(drop=True)
.drop(["Country Name", "Income Bracket"], axis=1)
)
developing_2020 = developing_2020.astype(float)
EXPLORATORY DATA ANALYSIS
# Load datasets
data_path = 'Data.csv'
metadata_path = 'SeriesMetadata.csv'
data = pd.read_csv(data_path, encoding='ascii', nrows=1064)
metadata = pd.read_csv(metadata_path, encoding='ISO-8859-1', nrows=1064)
SQLite¶
conn = sqlite3.connect("ACS-LT6.db")
# Update column names to remove spaces
data_updated = data.copy()
data_updated["2019 [YR2019]"] = pd.to_numeric(
data_updated["2019 [YR2019]"].replace("..", 0), errors="coerce"
).astype("float64")
data_updated["2020 [YR2020]"] = pd.to_numeric(
data_updated["2020 [YR2020]"].replace("..", 0), errors="coerce"
).astype("float64")
data_updated.columns = data_updated.columns.str.replace(" ", "")
# Extracting updated data for each table
country_data_updated = data_updated[[
"CountryName", "CountryCode"]].drop_duplicates()
series_data_updated = data_updated[[
"SeriesName", "SeriesCode"]].drop_duplicates()
# Rename '2019[YR2019]' to 'DataValue' for the 2019 table
year_2019_data_value = data_updated[[
"CountryCode", "SeriesCode", "2019[YR2019]"]]
year_2019_data_value.rename(
columns={"2019[YR2019]": "DataValue"}, inplace=True)
# Rename '2020[YR2020]' to 'DataValue' for the 2020 table
year_2020_data_value = data_updated[[
"CountryCode", "SeriesCode", "2020[YR2020]"]]
year_2020_data_value.rename(
columns={"2020[YR2020]": "DataValue"}, inplace=True)
# Write the updated data to SQL tables
country_data_updated.to_sql(
"countries", conn, if_exists="replace", index=False)
series_data_updated.to_sql("series", conn, if_exists="replace", index=False)
year_2019_data_value.to_sql("Y2019", conn, if_exists="replace", index=False)
year_2020_data_value.to_sql("Y2020", conn, if_exists="replace", index=False)
# Close the connection
conn.close()
# Connect to the SQLite database
conn = sqlite3.connect('ACS-LT6.db')
# Retrieve the list of tables in the database
tables_query = "SELECT name FROM sqlite_master WHERE type='table';"
tables = pd.read_sql_query(tables_query, conn)
table_names = tables['name'].tolist()
# Function to load data from a table and perform EDA
def perform_eda(table_name):
# Load data into a DataFrame
query = f"SELECT * FROM {table_name}"
table_df = pd.read_sql_query(query, conn)
# Basic Data Overview
print(f"Data Overview for {table_name}:")
print(table_df.info())
print(table_df.head())
# Descriptive Statistics
print(f"\nDescriptive Statistics for {table_name}:")
print(table_df.describe())
return table_df
def display_histogram(table_name, data):
# Visualization: Histograms for numeric data
numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns
for col in numeric_cols:
plt.figure(figsize=(8, 6))
sns.histplot(data[col], kde=True)
plt.title(f'Distribution of {col} in {table_name}')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.show()
def dsiplay_boxplots(table_name, data):
# Visualization: Boxplots for numeric data
numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns
for col in numeric_cols:
plt.figure(figsize=(8, 6))
sns.boxplot(data[col])
plt.title(f'Boxplot of {col} in {table_name}')
plt.xlabel(col)
plt.show()
def dsiplay_corr(table_name, data):
# Visualization: Boxplots for numeric data
numeric_cols = data.select_dtypes(include=['float64', 'int64']).columns
# Correlation Matrix for numeric data
if len(numeric_cols) > 1:
plt.figure(figsize=(10, 8))
sns.heatmap(data[numeric_cols].corr(), annot=True, cmap='coolwarm')
plt.title(f'Correlation Matrix for {table_name}')
plt.show()
# Perform EDA for each table in the database
for table in table_names:
table_data = perform_eda(table)
#display_histogram(table, table_data)
# Close the connection
conn.close()
Data Overview for countries:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CountryName 266 non-null object
1 CountryCode 266 non-null object
dtypes: object(2)
memory usage: 4.3+ KB
None
CountryName CountryCode
0 Afghanistan AFG
1 Albania ALB
2 Algeria DZA
3 American Samoa ASM
4 Andorra AND
Descriptive Statistics for countries:
CountryName CountryCode
count 266 266
unique 266 266
top Afghanistan AFG
freq 1 1
Data Overview for series:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 SeriesName 4 non-null object
1 SeriesCode 4 non-null object
dtypes: object(2)
memory usage: 192.0+ bytes
None
SeriesName SeriesCode
0 CO2 emissions (kt) EN.ATM.CO2E.KT
1 GDP (constant 2015 US$) NY.GDP.MKTP.KD
2 Population, total SP.POP.TOTL
3 Urban land area (sq. km) AG.LND.TOTL.UR.K2
Descriptive Statistics for series:
SeriesName SeriesCode
count 4 4
unique 4 4
top CO2 emissions (kt) EN.ATM.CO2E.KT
freq 1 1
Data Overview for Y2019:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1064 entries, 0 to 1063
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CountryCode 1064 non-null object
1 SeriesCode 1064 non-null object
2 DataValue 1064 non-null float64
dtypes: float64(1), object(2)
memory usage: 25.1+ KB
None
CountryCode SeriesCode DataValue
0 AFG EN.ATM.CO2E.KT 1.123883e+04
1 AFG NY.GDP.MKTP.KD 2.207199e+10
2 AFG SP.POP.TOTL 3.776950e+07
3 AFG AG.LND.TOTL.UR.K2 0.000000e+00
4 ALB EN.ATM.CO2E.KT 4.993300e+03
Descriptive Statistics for Y2019:
DataValue
count 1.064000e+03
mean 6.639437e+11
std 4.685527e+12
min 0.000000e+00
25% 0.000000e+00
50% 3.921470e+05
75% 9.775513e+08
max 8.472015e+13
Data Overview for Y2020:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1064 entries, 0 to 1063
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CountryCode 1064 non-null object
1 SeriesCode 1064 non-null object
2 DataValue 1064 non-null float64
dtypes: float64(1), object(2)
memory usage: 25.1+ KB
None
CountryCode SeriesCode DataValue
0 AFG EN.ATM.CO2E.KT 8.709470e+03
1 AFG NY.GDP.MKTP.KD 2.155305e+10
2 AFG SP.POP.TOTL 3.897223e+07
3 AFG AG.LND.TOTL.UR.K2 0.000000e+00
4 ALB EN.ATM.CO2E.KT 4.383200e+03
Descriptive Statistics for Y2020:
DataValue
count 1.064000e+03
mean 6.457085e+11
std 4.550461e+12
min 0.000000e+00
25% 0.000000e+00
50% 3.810641e+05
75% 9.116047e+08
max 8.211736e+13
conn = sqlite3.connect('ACS-LT6.db')
# Query to extract GDP and CO2 emissions data for 2019
query_2019 = """
SELECT c.CountryCode, c.CountryName, s.SeriesName, s.SeriesCode, y.DataValue as DataValue2019
FROM Y2019 y
JOIN countries c ON y.CountryCode = c.CountryCode
JOIN series s ON y.SeriesCode = s.SeriesCode
WHERE s.SeriesName = 'GDP (constant 2015 US$)' OR s.SeriesName = 'CO2 emissions (kt)';
"""
data_2019 = pd.read_sql_query(query_2019, conn)
# Query to extract GDP and CO2 emissions data for 2020
query_2020 = """
SELECT c.CountryCode, c.CountryName, s.SeriesName, s.SeriesCode, y.DataValue as DataValue2020
FROM Y2020 y
JOIN countries c ON y.CountryCode = c.CountryCode
JOIN series s ON y.SeriesCode = s.SeriesCode
WHERE s.SeriesName = 'GDP (constant 2015 US$)' OR s.SeriesName = 'CO2 emissions (kt)';
"""
data_2020 = pd.read_sql_query(query_2020, conn)
# Display the extracted data for 2019 and 2020
data_2019.head(), data_2020.head()
conn.close()
conn = sqlite3.connect('ACS-LT6.db')
# Query to extract GDP and CO2 emissions data for 2019
query_2019 = """
SELECT c.CountryCode, c.CountryName, s.SeriesName, s.SeriesCode, y.DataValue as DataValue2019
FROM Y2019 y
JOIN countries c ON y.CountryCode = c.CountryCode
JOIN series s ON y.SeriesCode = s.SeriesCode
WHERE s.SeriesName = 'GDP (constant 2015 US$)' OR s.SeriesName = 'CO2 emissions (kt)';
"""
data_2019 = pd.read_sql_query(query_2019, conn)
# Query to extract GDP and CO2 emissions data for 2020
query_2020 = """
SELECT c.CountryCode, c.CountryName, s.SeriesName, s.SeriesCode, y.DataValue as DataValue2020
FROM Y2020 y
JOIN countries c ON y.CountryCode = c.CountryCode
JOIN series s ON y.SeriesCode = s.SeriesCode
WHERE s.SeriesName = 'GDP (constant 2015 US$)' OR s.SeriesName = 'CO2 emissions (kt)';
"""
data_2020 = pd.read_sql_query(query_2020, conn)
# Display the extracted data for 2019 and 2020
data_2019.head(), data_2020.head()
conn.close()
# Preparing data for OLS regression: GDP as the dependent variable and CO2 emissions as the independent variable
# Filtering and merging data for GDP and CO2 emissions
gdp_data_2019 = data_2019[data_2019['SeriesName'] == 'GDP (constant 2015 US$)']
co2_data_2019 = data_2019[data_2019['SeriesName'] == 'CO2 emissions (kt)']
merged_data_2019 = pd.merge(gdp_data_2019, co2_data_2019, on='CountryCode', suffixes=('_GDP', '_CO2'))
# Filtering and merging data for 2020
gdp_data_2020 = data_2020[data_2020['SeriesName'] == 'GDP (constant 2015 US$)']
co2_data_2020 = data_2020[data_2020['SeriesName'] == 'CO2 emissions (kt)']
merged_data_2020 = pd.merge(gdp_data_2020, co2_data_2020, on='CountryCode', suffixes=('_GDP', '_CO2'))
# Performing OLS for 2019
Y_2019 = merged_data_2019['DataValue2019_GDP']
X_2019 = merged_data_2019['DataValue2019_CO2']
X_2019 = sm.add_constant(X_2019) # Adds a constant term to the predictor
# Fit the OLS model for 2019
model_2019 = sm.OLS(Y_2019, X_2019).fit()
# Performing OLS for 2020
Y_2020 = merged_data_2020['DataValue2020_GDP']
X_2020 = merged_data_2020['DataValue2020_CO2']
X_2020 = sm.add_constant(X_2020) # Adds a constant term to the predictor
# Fit the OLS model for 2020
model_2020 = sm.OLS(Y_2020, X_2020).fit()
plt.figure(figsize=(10, 6))
plt.hist(merged_data_2019['DataValue2019_GDP'], bins=20, color='red', alpha=0.7)
plt.title('Histogram of GDP for 2019')
plt.xlabel('GDP (constant 2015 US$)')
plt.ylabel('Frequency')
plt.show()
plt.figure(figsize=(10, 6))
plt.hist(merged_data_2019['DataValue2019_CO2'], bins=20, color='blue', alpha=0.7)
plt.title('Histogram of CO2 Emissions for 2019')
plt.xlabel('CO2 Emissions (kt)')
plt.ylabel('Frequency')
plt.show()
plt.figure(figsize=(10, 6))
sns.boxplot(data=[merged_data_2019['DataValue2019_CO2'], merged_data_2020['DataValue2020_CO2']], palette=['blue', 'green'])
plt.title('Box Plot of CO2 Emissions for 2019 and 2020')
plt.xticks([0, 1], ['2019', '2020'])
plt.ylabel('CO2 Emissions (kt)')
plt.show()
plt.figure(figsize=(10, 6))
plt.hist(merged_data_2020['DataValue2020_CO2'], bins=20, color='green', alpha=0.7)
plt.title('Histogram of CO2 Emissions for 2020')
plt.xlabel('CO2 Emissions (kt)')
plt.ylabel('Frequency')
plt.show()
plt.figure(figsize=(10, 6))
sns.boxplot(data=[merged_data_2019['DataValue2019_GDP'], merged_data_2020['DataValue2020_GDP']], palette=['red', 'purple'])
plt.title('Box Plot of GDP for 2019 and 2020')
plt.xticks([0, 1], ['2019', '2020'])
plt.ylabel('GDP (constant 2015 US$)')
plt.show()
STATISTICAL ANALYSIS
"EARTH IS HEALING"
During the pandemic, a popular narrative emerged in news outlets and social media. That narrative was: "Earth is Healing". To test that narrative in terms of CO2 emissions we can conduct paired observation t-test on data for 2019 and 2020.
One-Tailed t-test on Paired Observations
$\alpha = 0.05$
NULL HYPOTHESIS
$\mu_{o}$ = $d_{o}$
There is no significant difference in CO2 emissions between 2019 and 2020.
ALTERNATIVE HYPOTHESIS
$\mu_{o}$ > $d_{o}$
There is a significant decrease in CO2 emissions between 2019 and 2020.
result = stats.ttest_rel(df_emission_2019,
df_emission_2020,
alternative='greater')
print(f'The paired observation t-score is: {result.statistic:.4f}')
print(f'The p-value is: {result.pvalue:.4f}')
The paired observation t-score is: 2.4731 The p-value is: 0.0072
We reject the null hypothesis that there is no difference in CO2 emissions between 2019 and 2020 (p = 0.0072, α = 0.05). Therefore, there is sufficient evidence to suggest that there is a significant decrease in CO2 emissions from 2019 ato 2020, during the pandemic.
ECONOMIC IMPACT OF COVID ON GDP
Given that there is significant evidence that CO2 emissions have reduced during the pandemic period, it would be interesting to investigate if GDP also experienced the same decline. Since we are looking into the relationship between GDP and CO2 emissions, we would expect the both of them to decline if they are positively correlated in some way.
One-Tailed t-test on Paired Observations
$\alpha = 0.05$
NULL HYPOTHESIS
$\mu_{o}$ = $d_{o}$
There is no significant difference in GDP between 2019 and 2020.
ALTERNATIVE HYPOTHESIS
$\mu_{o}$ > $d_{o}$
There is a significant decrease in GDP between 2019 and 2020.
result = stats.ttest_rel(df_gdp_2019,
df_gdp_2020,
alternative='greater')
print(f'The paired observation t-score is: {result.statistic:.2f}')
print(f'The p-value is: {result.pvalue:.4f}')
The paired observation t-score is: 3.02 The p-value is: 0.0014
We reject the null hypothesis that there is no difference in GDP of countries between 2019 and 2020 (p = 0.0014, α = 0.05). Therefore, there is sufficient evidence to suggest that there is a significant decrease in GDP of countries from 2019 to 2020, during the pandemic.
Since both CO2 emissions and GDP decreased during the pandemic period, this somehow supports the assumption that they are positively correlated. However, there could be latent factors at play. The Covid-19 situation is a unique one and the behavior of GDP and CO2 during this period may not be consistent with their normal patterns out of pandemic. Further investigation can be done to reveal more insights between the relationship of GDP and CO2 emissions.
CO2 EMISSIONS OF COUNTRIES PER INCOME BRACKET
In the framework of the Paris Agreement on Climate Change has one key aspect in it's framework, and that is Financing. Developed countries are expected to provide financial assistance to developing countries who are the most affected by the severe consequences of climate change. This provision relies on one key assumption, and that is developed countries are the greatest contributors to greenhouse gases. To test this we can perform ANOVA on CO2 emission per capita to check if there are significant differences among income brackets.
INCOME BRACKETS:
| Category | GNI | Examples |
|---|---|---|
| HIGH | $\gt$ 12,535 | United States, Singapore, Switzerland |
| UPPER-MIDDLE | 4,046 - 12,535 | Mexico, Thailand, China |
| LOWER-MIDDLE | 1,036 - 4,045 | Philippines, India, Ukraine |
| LOW | 1,036 - 4,045 | Ethiopia, Syria, Afghanistan |
One-Tailed t-test on Paired Observations
$\alpha = 0.05$
NULL HYPOTHESIS
$\mu_{h}$ = $\mu_{um}$ = $\mu_{lm}$ = $\mu_{l}$
There is no significant difference in CO2 emission per capita of the four income groups
ALTERNATIVE HYPOTHESIS
$Not\,all\,means\,are\,equal.$
At least one of the income groups has a significantly different CO2 emission per capita compared to the other income groups.
# Function to select random samples within each group
def sample_from_group(group, num_samples):
return (group.sample(n=num_samples, replace=False, random_state=23)
if len(group) >= num_samples
else group)
# Apply the function to create equally sized sets of random samples
random_samples = df_epc_2019.groupby(
'Income Bracket', group_keys=False).apply(sample_from_group, num_samples=20)
# ANOVA using pingouin
# Perform one-way ANOVA
anova_result = pg.anova(data=random_samples,
dv='CO2 per capita', between='Income Bracket')
# Print results
print(anova_result, '\n')
# Interpret results
if anova_result['p-unc'][0] < 0.05:
print('Reject the null hypothesis. There is a significant '
'difference in at least one group.')
else:
print('Fail to reject the null hypothesis.'
'There is no significant difference in group means.')
# anova_result
Source ddof1 ddof2 F p-unc np2 0 Income Bracket 3 76 35.228643 2.236668e-14 0.581696 Reject the null hypothesis. There is a significant difference in at least one group.
# Assuming you have a DataFrame 'data' with 'value' and 'group' columns
tukey_results = pairwise_tukeyhsd(
random_samples['CO2 per capita'], random_samples['Income Bracket'])
print(tukey_results)
Multiple Comparison of Means - Tukey HSD, FWER=0.05
=====================================================
group1 group2 meandiff p-adj lower upper reject
-----------------------------------------------------
H L -8.0927 0.0 -10.3371 -5.8483 True
H LM -6.9957 0.0 -9.2402 -4.7513 True
H UM -4.8194 0.0 -7.0638 -2.575 True
L LM 1.0969 0.576 -1.1475 3.3414 False
L UM 3.2733 0.0015 1.0289 5.5177 True
LM UM 2.1764 0.0607 -0.0681 4.4208 False
-----------------------------------------------------
The results show that there 4 paired-differences in the means. We can observe that sequential brackets have no significant difference, except for the High Income bracket. It shows that High Income brackets have significantly greater CO2 emissions per capita than the rest of the world. This supports the basis of the Paris Agreements' financial aid provision.
| INCOME BRACKET PAIR | SIGNIFICANT DIFFERENCE |
|---|---|
| HIGH - UPPER MIDDLE | YES |
| UPPER MIDDLE -LOWER MIDDLE | NO |
| LOWER MIDDLE - LOW | NO |
NON CONSECUTIVE PAIRS (ex. HIGH - LOW) |
YES |
Regression Analysis
FUNCTIONS for Class Pipeline¶
class RegressionAnalysis:
def __init__(self, df):
self.X = df.iloc[:, 1:].astype(float)
self.y = df.iloc[:, 0].astype(float)
self.model = None
def fit_regression(self):
# Add the constant (alpha) to the regression model
self.X = sm.add_constant(self.X)
# Fit the regression model using OLS (Ordinary Least Squares)
self.model = sm.OLS(self.y, self.X).fit()
# Print the model summary statistics
print(self.model.summary(), end="\n\n")
# Get and print the coefficients
coefficients = self.model.params
print("=" * 100)
print("Coefficients")
print(coefficients)
def plots(self):
# Create subplots
fig, axs = plt.subplots(1, 3, figsize=(18, 6))
# Residuals vs Fitted Values plot
axs[0].scatter(self.model.fittedvalues, self.model.resid)
axs[0].set_xlabel("Fitted Values")
axs[0].set_ylabel("Residuals")
axs[0].set_title("Residuals vs Fitted Values")
# Observed vs Fitted Values plot
axs[1].scatter(self.model.fittedvalues, self.y)
axs[1].set_xlabel("Fitted Values")
axs[1].set_ylabel("Observed")
axs[1].set_title("Observed vs Fitted Values")
# Normal Q-Q plot
sm.qqplot(self.model.resid, line="r", ax=axs[2])
axs[2].set_title("Normal Q-Q Plot")
# Adjust layout
plt.tight_layout()
# Show the combined figure
plt.show()
def stepwise_selection(self):
included = []
while True:
excluded = list(set(self.X.columns) - set(included))
new_pval = pd.Series(index=excluded)
for new_column in excluded:
model = sm.OLS(
self.y,
sm.add_constant(pd.DataFrame(
self.X[included + [new_column]])),
).fit() # do we need to remove the constant here?
new_pval[new_column] = model.pvalues[new_column]
best_pval = new_pval.min()
if best_pval < 0.05:
best_feature = new_pval.idxmin()
included.append(best_feature)
else:
break
print("=" * 149, end="\n")
print("Results of the stepwise selection:", included, end="\n\n")
def check_multicollinearity(self):
vif_data = pd.DataFrame()
vif_data["Variable"] = self.X.columns
vif_data["VIF"] = [
variance_inflation_factor(self.X.values, i)
for i in range(self.X.shape[1])
]
# Check for variables with high VIF
print("=" * 117)
print("Results of VIF:\n", vif_data)
Full Model Regression
This involves all countries across 2019 and 2020.
Full Model Regression (2019)¶
# Full model for 2019 involving all countries
reg_2019_all = RegressionAnalysis(df_all_2019)
reg_2019_all.fit_regression()
reg_2019_all.plots()
reg_2019_all.stepwise_selection()
reg_2019_all.check_multicollinearity()
OLS Regression Results
================================================================================
Dep. Variable: CO2 Emissions (2019) R-squared: 0.955
Model: OLS Adj. R-squared: 0.953
Method: Least Squares F-statistic: 460.4
Date: Wed, 06 Dec 2023 Prob (F-statistic): 1.03e-43
Time: 02:14:28 Log-Likelihood: -967.88
No. Observations: 69 AIC: 1944.
Df Residuals: 65 BIC: 1953.
Df Model: 3
Covariance Type: nonrobust
======================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------------
const -8.54e+04 4.01e+04 -2.130 0.037 -1.65e+05 -5333.787
GDP (2019) 1.135e-07 2.06e-08 5.508 0.000 7.23e-08 1.55e-07
Population (2019) 0.0013 0.000 5.549 0.000 0.001 0.002
Total Renewable Energy Consumption 3415.3423 306.316 11.150 0.000 2803.588 4027.097
==============================================================================
Omnibus: 57.932 Durbin-Watson: 2.108
Prob(Omnibus): 0.000 Jarque-Bera (JB): 389.392
Skew: -2.317 Prob(JB): 2.78e-85
Kurtosis: 13.675 Cond. No. 3.40e+12
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.4e+12. This might indicate that there are
strong multicollinearity or other numerical problems.
====================================================================================================
Coefficients
const -8.539921e+04
GDP (2019) 1.134794e-07
Population (2019) 1.329198e-03
Total Renewable Energy Consumption 3.415342e+03
dtype: float64
=====================================================================================================================================================
Results of the stepwise selection: ['Total Renewable Energy Consumption', 'Population (2019)', 'GDP (2019)', 'const']
=====================================================================================================================
Results of VIF:
Variable VIF
0 const 1.168443
1 GDP (2019) 2.682763
2 Population (2019) 2.320713
3 Total Renewable Energy Consumption 4.217482
Full Model Regression (2020)¶
# Full model for 2020 involving all countries
reg_2020_all = RegressionAnalysis(df_all_2020)
reg_2020_all.fit_regression()
reg_2020_all.plots()
reg_2020_all.stepwise_selection()
reg_2020_all.check_multicollinearity()
OLS Regression Results
================================================================================
Dep. Variable: CO2 Emissions (2020) R-squared: 0.953
Model: OLS Adj. R-squared: 0.950
Method: Least Squares F-statistic: 435.0
Date: Wed, 06 Dec 2023 Prob (F-statistic): 6.03e-43
Time: 02:14:28 Log-Likelihood: -969.46
No. Observations: 69 AIC: 1947.
Df Residuals: 65 BIC: 1956.
Df Model: 3
Covariance Type: nonrobust
======================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------------
const -8.535e+04 4.09e+04 -2.084 0.041 -1.67e+05 -3568.567
GDP (2020) 7.922e-08 2.23e-08 3.547 0.001 3.46e-08 1.24e-07
Population (2020) 0.0012 0.000 4.824 0.000 0.001 0.002
Total Renewable Energy Consumption 3588.6322 301.187 11.915 0.000 2987.121 4190.144
==============================================================================
Omnibus: 58.260 Durbin-Watson: 2.102
Prob(Omnibus): 0.000 Jarque-Bera (JB): 381.445
Skew: -2.352 Prob(JB): 1.48e-83
Kurtosis: 13.514 Cond. No. 3.34e+12
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.34e+12. This might indicate that there are
strong multicollinearity or other numerical problems.
====================================================================================================
Coefficients
const -8.534701e+04
GDP (2020) 7.921794e-08
Population (2020) 1.170253e-03
Total Renewable Energy Consumption 3.588632e+03
dtype: float64
=====================================================================================================================================================
Results of the stepwise selection: ['Total Renewable Energy Consumption', 'Population (2020)', 'GDP (2020)', 'const']
=====================================================================================================================
Results of VIF:
Variable VIF
0 const 1.164342
1 GDP (2020) 2.930309
2 Population (2020) 2.301857
3 Total Renewable Energy Consumption 4.493000
Interpretation of the Full Model Regression Analysis¶
Using a 5% significance level, all the predictor variables in the full model involving all countries contribute significantly to explaining the variability of CO2 emissions. Moreover, the predictor variables are just moderately correlated with each other across the two years. For both years, as well as in the succeeding regression models, a constant was added to the model to account for the extraneous variables that also contribute to CO2 emissions but were not considered in the regression analysis.
Isolated Regression
This section provides a preliminary investigation of the Environmental Kuznets Curve (EKC) Theory, a hypothesis that suggests that environmental degradation initially increases when economic expansion occurs. However, at a certain point, a society starts to improve its relationship with the environment and environmental degradation levels start to decline. This phenomenon is best described by Figure 6.
Under the context of EKC, the relationship between economic development and environmental degradation is analyzed longitudinally. As such, cointegration tests are used to determine the long-term relationship between the two variables. This is in constrast to the method of correlation, which focuses on a much shorter timeframe. Given the limited scope of our discussions, a correlation analysis though regression was used to make an initial deep dive into the EKC theory.
We do separate regressions on developing countries and developed countries. We surmise that the EKC is present on a particular year if the slope of the regression line for the developing countries are steeper than that of the developed countries.
Developing vs. Developed¶
The mapper file contains countries and their designated income bracket:
L: Low incomeLM: Low-middle incomeUM: Upper-middle incomeH: High income
In this study, developing countries include low-income, low-middle, and upper-middle income countries, while developed countries include high-income countries.
Regression Analysis of Developed Countries (2019 and 2020)¶
# Regression analysis for the developed countries (2019)
reg_developed_2019 = RegressionAnalysis(developed_2019)
reg_developed_2019.fit_regression()
reg_developed_2019.plots()
reg_developed_2019.stepwise_selection()
reg_developed_2019.check_multicollinearity()
OLS Regression Results
================================================================================
Dep. Variable: CO2 Emissions (2019) R-squared: 0.975
Model: OLS Adj. R-squared: 0.973
Method: Least Squares F-statistic: 501.5
Date: Wed, 06 Dec 2023 Prob (F-statistic): 1.37e-30
Time: 02:14:29 Log-Likelihood: -549.59
No. Observations: 42 AIC: 1107.
Df Residuals: 38 BIC: 1114.
Df Model: 3
Covariance Type: nonrobust
======================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------------
const -1.072e+04 2.4e+04 -0.447 0.658 -5.93e+04 3.79e+04
GDP (2019) 2.406e-07 3.14e-08 7.662 0.000 1.77e-07 3.04e-07
Population (2019) -0.0007 0.002 -0.407 0.686 -0.004 0.003
Total Renewable Energy Consumption 215.3535 319.536 0.674 0.504 -431.513 862.220
==============================================================================
Omnibus: 11.215 Durbin-Watson: 1.951
Prob(Omnibus): 0.004 Jarque-Bera (JB): 29.598
Skew: -0.277 Prob(JB): 3.74e-07
Kurtosis: 7.075 Cond. No. 4.21e+12
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.21e+12. This might indicate that there are
strong multicollinearity or other numerical problems.
====================================================================================================
Coefficients
const -1.072026e+04
GDP (2019) 2.406131e-07
Population (2019) -6.928047e-04
Total Renewable Energy Consumption 2.153535e+02
dtype: float64
=====================================================================================================================================================
Results of the stepwise selection: ['GDP (2019)']
=====================================================================================================================
Results of VIF:
Variable VIF
0 const 1.610832
1 GDP (2019) 26.547516
2 Population (2019) 23.569251
3 Total Renewable Energy Consumption 4.354519
# Regression analysis for the developed countries (2020)
reg_developed_2020 = RegressionAnalysis(developed_2020)
reg_developed_2020.fit_regression()
reg_developed_2020.plots()
reg_developed_2020.stepwise_selection()
reg_developed_2020.check_multicollinearity()
OLS Regression Results
================================================================================
Dep. Variable: CO2 Emissions (2020) R-squared: 0.976
Model: OLS Adj. R-squared: 0.974
Method: Least Squares F-statistic: 508.6
Date: Wed, 06 Dec 2023 Prob (F-statistic): 1.05e-30
Time: 02:14:30 Log-Likelihood: -544.84
No. Observations: 42 AIC: 1098.
Df Residuals: 38 BIC: 1105.
Df Model: 3
Covariance Type: nonrobust
======================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------------
const -5820.4427 2.15e+04 -0.271 0.788 -4.94e+04 3.77e+04
GDP (2020) 2.141e-07 2.85e-08 7.518 0.000 1.56e-07 2.72e-07
Population (2020) 1.603e-05 0.001 0.011 0.991 -0.003 0.003
Total Renewable Energy Consumption 124.0802 278.209 0.446 0.658 -439.124 687.285
==============================================================================
Omnibus: 10.671 Durbin-Watson: 1.935
Prob(Omnibus): 0.005 Jarque-Bera (JB): 28.324
Skew: 0.176 Prob(JB): 7.07e-07
Kurtosis: 7.008 Cond. No. 4.09e+12
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.09e+12. This might indicate that there are
strong multicollinearity or other numerical problems.
====================================================================================================
Coefficients
const -5.820443e+03
GDP (2020) 2.140910e-07
Population (2020) 1.602640e-05
Total Renewable Energy Consumption 1.240802e+02
dtype: float64
=====================================================================================================================================================
Results of the stepwise selection: ['GDP (2020)']
=====================================================================================================================
Results of VIF:
Variable VIF
0 const 1.621499
1 GDP (2020) 25.742180
2 Population (2020) 22.567237
3 Total Renewable Energy Consumption 4.782371
Interpretation¶
In the regression models for developed countries, it's evident that only GDP significantly contributes to explaining the variance in CO2 emissions for both 2019 (97.8%) and 2020 (97.6%). The inclusion of population and total renewable energy consumption did not alter the R-squared, implying that these variables introduced no meaningful information and instead introduced noise to the model.
Regression Analysis of Developing Countries (2019 and 2020)¶
# Regression analysis for the developing countries (2019)
reg_developing_2019 = RegressionAnalysis(developing_2019)
reg_developing_2019.fit_regression()
reg_developing_2019.plots()
reg_developing_2019.stepwise_selection()
reg_developing_2019.check_multicollinearity()
OLS Regression Results
================================================================================
Dep. Variable: CO2 Emissions (2019) R-squared: 0.991
Model: OLS Adj. R-squared: 0.990
Method: Least Squares F-statistic: 877.0
Date: Wed, 06 Dec 2023 Prob (F-statistic): 7.58e-24
Time: 02:14:30 Log-Likelihood: -366.39
No. Observations: 27 AIC: 740.8
Df Residuals: 23 BIC: 746.0
Df Model: 3
Covariance Type: nonrobust
======================================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------------------------
const -1.116e+05 4.43e+04 -2.518 0.019 -2.03e+05 -1.99e+04
GDP (2019) 9.485e-07 1.06e-07 8.952 0.000 7.29e-07 1.17e-06
Population (2019) 0.0003 0.000 1.407 0.173 -0.000 0.001
Total Renewable Energy Consumption -1679.9926 724.665 -2.318 0.030 -3179.077 -180.908
==============================================================================
Omnibus: 16.476 Durbin-Watson: 2.426
Prob(Omnibus): 0.000 Jarque-Bera (JB): 17.481
Skew: -1.660 Prob(JB): 0.000160
Kurtosis: 5.125 Cond. No. 3.20e+12
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.2e+12. This might indicate that there are
strong multicollinearity or other numerical problems.
====================================================================================================
Coefficients
const -1.115589e+05
GDP (2019) 9.485305e-07
Population (2019) 2.756669e-04
Total Renewable Energy Consumption -1.679993e+03
dtype: float64
=====================================================================================================================================================
Results of the stepwise selection: ['GDP (2019)', 'Total Renewable Energy Consumption', 'const']
=====================================================================================================================
Results of VIF:
Variable VIF
0 const 1.260093
1 GDP (2019) 51.923447
2 Population (2019) 3.100136
3 Total Renewable Energy Consumption 44.659468
Interpretation¶
In the regression models applied to developing countries, it is noteworthy that only GDP makes a substantial and statistically significant contribution to explaining the variation in CO2 emissions, accounting for 99.1% in 2019 and 99.2% in 2020.
Status of the EKC Theory in 2019 and 2020¶
Between the regression models of developed and developing countries in 2019 and 2020, the variable GDP was among the significant variables that could best explain the variability in CO2 emissions. For both developed and developing countries, an increase in national income is at the expense of environmental degradation, and this is higher for developing countries. Thus, in Figure 1, we can say that both developed and developing countries are on the upward sloping part of the EKC.
CONCLUSION
The study shows a clear link between a country's GDP and its CO2 emissions. High-income countries tend to have higher CO2 emissions per capita, supporting the idea that wealthier nations contribute more to global emissions. The analysis found that GDP is a key factor in predicting CO2 emissions in both developed and developing countries. In developed countries, GDP was the main factor affecting emissions, while in developing countries, the impact of GDP on emissions was even stronger. This supports the Environmental Kuznets Curve (EKC) hypothesis, suggesting that in poorer countries, economic growth initially leads to more environmental harm. However, the study also indicates that economic growth alone might not lead to environmental improvement. This highlights the need for targeted policies to ensure that economic development goes hand in hand with environmental sustainability.
RECOMMENDATIONS
Policy Recommendations¶
Differential Carbon Taxes: Implement a system where high-income countries are subject to higher carbon tax rates, while providing lower rates or subsidies for developing countries. This approach aims to balance economic growth with environmental responsibility.
Technology Transfer to Developing Countries: CO2 emissions have been found to be inversely related to the consumption of renewable energy. Therefore, assisting developing countries with high levels of CO2 emissions in adopting renewable energy technology is a logical step. Developed nations can contribute by sharing their technological advancements and providing guidance on implementation.
International Standards for Sustainable Development: Establish and enforce global standards for sustainable development, with specific targets for CO2 emission reduction and renewable energy adoption. Compliance could be encouraged through international benefits and recognition.
References¶
Pettinger, T. (2019, September 11). Environmental Kuznets curve - Economics Help. Economics Help. https://www.economicshelp.org/blog/14337/environment/environmental-kuznets-curve/
United Nations. (n.d.). The Paris Agreement. United Nations. https://www.un.org/en/climatechange/paris-agreement#:~:text=The%20Agreement%20is%20a%20legally
UNFCCC. (2015). The Paris Agreement. United Nations Climate Change; United Nations. https://unfccc.int/process-and-meetings/the-paris-agreement